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Abstract 

Computing quark propagators with overlap fermions requires the solution of a shifted 
unitary linear system. Jagels and Reichel have shown that for such systems it is possible to 
construct a minimal residual algorithm by short recurrences. The Jiilich-Wuppertal group 
have found this algorithm to be the fastest among overlap solvers. In this paper we present 
a three-term recurrence for the Arnoldi unitary process. Using the new recurrence we con- 
struct a minimal residual solver which is the fastest among all Krylov subspace algorithms 
considered so far for the overlap inversion. 
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1 Introduction 



Lattice theories with chiral quarks provide an accurate tool for studying the physics of strong in- 
teractions. The physical information of these theories is contained in quark propagators, which 
are then combined to construct meson, nucleon and even more complicated elementary particle 
propagators. Quark propagator computations amount to solving large linear systems of the type: 

Dx = b (1.1) 

where D G C NxN is a dense matrix operator representing the chiral Dirac operator on a regular 
four dimensional space-time lattice, x, b G C N are the quark propagator and its source. 

More than a decade ago two formulations of chiral fermion theories were discovered: the 
domain wall fermions [1,2] and overlap fermions [3]. These apparently different formulations 
are closely related to each other [4]. In particular, the truncated overlap variant of domain wall 
fermions [5] can be shown to be equivalent to overlap fermions in four dimensions at any lattice 
spacing [6]. Therefore, for computational purposes it is sufficient to consider the Neuberger 
Dirac operator, which is a shifted unitary matrix of the form [7]: 

D = Cl l + c 2 V (1.2) 

where 

V = DwiD^Dw)- 1 ' 2 (1.3) 

ci = (l + m)/2 (1.4) 

c 2 = (l-m)/2 (1.5) 

Here Dw is the Wilson-Dirac operator, which is a non-Hermitian sparse matrix. It is easy to see 
why V is a unitary matrix: starting from the singular value decomposition of D w = XTY*, 
one gets V = XY*. 

Another way to look into the Neuberger operator is to use operators T 5 =diag(i~7v/2, —In/2) 
1 and f 5 =sign(H w ), where H w = T 5 D W is the Hermitian Wilson-Dirac operator. In this case 
'We assume a Dirac spinor ordering compatible with the definition of T^. 
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it is easy to see that the overlap operator takes the form: 

D = Cl l + c 2 r 5 f 5 



(1.6) 



This representation can be used to show that D*D commutes with T 5 . As a result, D*D is 
block diagonal in the chiral basis of T 5 . This observation has important applications: if we 
would like to solve linear systems with coefficient matrix D*D we can use Conjugate Gradients 
(CG) algorithm, which we know is optimal. Indeed, CG is very well suited for the molecular 
dynamics algorithm which generates the ensemble of gauge fields in lattice QCD with overlap 
fermions. 

However, another important task is the computation of quark propagators which we consider 
here. This requires the solution of the linear systems of the type 1.1. One can still use CG on 
normal equations (CGNE) but this method may not be optimal in this case. In this paper we 
seek optimal solutions of the above system in the Krylov subspace: 

x e JC k = span{b, Db, . . . , D^b} (1.7) 

It is well known that the generalized minimal residual method (GMRES) is the optimal method 
that can be constructed in this subspace. However, it is well known as well that the underlying 
Arnoldi process produces long recurrences, a feature which limits the benefits of the GMRES 
algorithm for large problems. Nonetheless, Jagels and Reichel [8] found that for shifted unitary 
systems it is possible to construct an Arnoldi process with short recurrences. The method, the 
shifted unitary minimal residual algorithm (SUMR) was shown by Jiilich-Wuppertal group [9] 
to be the fastest linear solver for quark propagator computations. 

Here we propose an approach, which is based on a three-term recurrence of the unitary 
Arnoldi process. This process is used to construct two optimal Krylov subspace methods: the 
shifted unitary orthogonal method (SUOM) and its geometric optimal counterpart SHUMR. 
We show that this algorithm converges faster then SUMR for all lattices and quark masses 
considered in this paper. 

The paper is organised as follows: section 2 describes the new recurrence of the unitary 
Arnoldi process. Section 3 deals with the construction of the SUOM and SHUMR algorithms. 
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In section 4 we compare directly these algorithms to SUMR and other Krylov subspace meth- 
ods. Conclusions are drawn in section 5. 



2 A three-term recurrence for the Arnoldi unitary process 

Our aim in this section is to construct an orthogonal basis of the Krylov subspace 1.7 with 
as few operations as possible which will then be useful for the solution of the linear system 
1.1. As a starting point we use the well-known Arnoldi process [10], which is essentially a 
modified Gram-Schmidt process, given in Algorithm 1 . After k steps of this algorithm, the next 

Algorithm 1 Arnoldi process 



for k — 1, . . . do 

w = Vq k 

for j — 1, . . . , k do 

h k j = q*w 
w := w — qjhjk 
end for 



if hk+i,k — then 

stop 
end if 

q k+ i = w/h k+1<k 
end for 

unnormalized Arnoldi vector is given by: 



The right hand side contains a linear combination over all computed Arnoldi vectors. This 
makes the overall complexity of the algorithm to grow quadratically with k. Moreover, the 
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H=b/\\b\\ 2 




k 




(2.1) 



computer memory grows linearly with k, a requirement which is often prohibitive and thus 
undesirable for large problems like ours. 

Fortunately, for unitary matrices one can do better. After k steps of Algorithm 1 we can 
write: 

VQk = QkHk + h k+ i ik q k+ iel = Q k+ \H k (2.2) 

where Q k = [qi, ... , q k ] is the matrix of orthonormal Arnoldi vectors, H k is an k x k upper 
Hessenberg matrix and H k is a k + 1 x k matrix obtained by appending the row h k+ i jk e k to the 
matrix H k . 

Proposition 2.1 H k has orthonormal columns. 

Proof. Multiplying both sides of 2.2 from left by Q* k we get: 

QlVQ k = H k (2.3) 
Multiplying both sides of 2.2 from left again but now by (VQ k )* we get: 

h = Q* k V*Q k H k + h k+1 , k (q* k+1 VQ k )*el (2.4) 
From 2.3 and 2.2 one gets Q* k V*Q k = HI and q* k+l VQ k = h k+1 e^. Thus, we find that: 

h = HlH k + h 2 k+lk e k el = E* k E k (2.5) 
which proves the proposition. □ 
Corollary 2.2 From 2.5 it is clear that H k is a unitary matrix if its last column is normalised. 

This property was used be Jagels and Reichel to write H k as a product of k elementary 
Givens rotations, which are then exploited to construct a coupled two-term recurrence Arnoldi 
algorithm (see Algorithm 3.1 of [8]). They note that solving the coupled recurrences would still 
lead to an algorithm where all vectors are required. 

In contrast to their work, we give an algorithm which is defined by short recurrences. We 
start by noting that the LU-decomposition of the upper Hessenberg matrix H k can be written in 
the form: 

H k = LfeC/- 1 (2.6) 
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where L k is a lower bidiagonal matrix and U k is upper triangular. For our convenience, we take 
the diagonal elements of U k to be one. Substituting this decomposition into 2.5, multiplying the 
result from the right by U k and from the left by 17% we get: 

U* k U k = L* k L k + h 2 k+lik e k el (2.7) 

Since the right hand side is a tridiagonal Hermitian matrix so must be the left hand side. Hence, 
U k should be upper bidiagonal. This decomposition, eq. 2.6 was used for the first time by 
Rutishauser to compute the eigenvalues of orthogonal Hessenberg matrices [1 1]. We can easily 
extend this decomposition for the matrix H k : 

H k = L k U k 1 , L k = L k + l k+ i jk e k (2.8) 

which gives h k +i,k = h+i,k- 

Multiplying both sides of 2.2 by U k we get: 

VQ k U k = QkL k + h+i,kQk+i^ k ■ (2.9) 
This way, the next Arnoldi vector can be computed using: 

k+i,k Qk+i = (V - l k kl)qk + Vqk~\Uk~\,k (2.10) 

where 

u k -i,k = — r~r, ( 2 - n ) 

Q k -i v< ik-i 

and 

hk = q* k Vq k + q* k V q k -iu k -i, k (2.12) 

These expressions allow us to construct the three-term recurrence Arnoldi unitary process as 
given in Algorithm 2. 

Note that the algorithm needs an additional inner product than the usual (eg. Lanczos) three- 
term recurrence algorithms. The reason is the appearance of the matrix V both in the diagonal 
and subdiagonal terms of 2.10. The algorithm is equivalent to the normal Arnoldi algorithm in 
exact arithmetic as the following proposition shows: 
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Algorithm 2 Arnoldi unitary process 



u 01 = 0, qi =b/\\b\ | 2 
for k — 1, . . .n do 

w fc = Vg fc 

Mfe-i,fe = -{(fk-iWk) I '{ql^Wk-i) (for k> 1) 
4fc = g^Wfe + q* k w k ^iu k ^ ltk 
w k+ i =w k - l kk q k + M fe _i ifc w fc _i 

4+l,fc = ||Wfe+l||2 

if h+i,k = then 

stop 
end if 

Qk+l = Wk+i/lk+l,k 

end for 



Proposition 2.3 Given qi and V unitary the unitary Arnoldi process, Algorithm 2 generates the 
orthonormal vectors Qk, the lower and upper bidiagonal matrices L k , Uk such that L^-JJ^ 1 is 
a matrix with orthonormal columns. 

Proof. The algorithm produces orthonormal vectors Q k and matrices L k and U k such that equa- 
tions 2.9 is satisfied. Multiplying both sides of these equations by U^ 1 we get new equations 
where L k U k l is upper Hessenberg. From Proposition 2.1 follows that H k and thus L^U^ 1 is 
a matrix with orthonormal columns. □ 

Remark 2.4 Both Arnoldi processes produce the same basis for the Krylov subspace but Algo- 
rithm 2 is more efficient: its complexity is linear in k and its computer memory requirement is 
constant at each step k. 
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3 Optimal Krylov subspace methods for the overlap inver- 
sion 



The overlap operator is a non-Hermitian matrix operator. For such matrices GMRES is knwon 
to be the optimal algorithm. We call this class of algorithms geometrically optimal. Another 
algorithm which can be used for such problems is the full orthogonalisation method (FOM). In 
this case the k th residual vector, 



lies orthogonal to the Krylov subspace KL k . Because of this algebraic property we call this class 
of algorithms algebraically optimal. It can be shown that when the norm-minimising process 
of GMRES is converging rapidly, the residual norms in the corresponding Galerkin process of 
FOM exhibit similar behaviour [12]. 

Both methods GMRES and FOM use the Arnoldi process to generate the iterates. In this 
section we will use the Arnoldi unitary process to construct two algorithms: the first one is the 
specialisation of FOM to shifted unitary systems, whereas the second is the specialisation of 
GMRES to these systems. 

3.1 A full orthogonalisation strategy 

We seek the solution in the Krylov subspace spanned by the Arnoldi vectors, i.e. 



r k = b- Dx k 



(3.1) 



x k = QkVk, Vk e C 



■k 



(3.2) 



Assuming for simplicity that ||6|| = 1, and using eq. 3.1 the residual vector will be, 



r k = Qkei - DQ k y k 



(3.3) 



Using the definition of D, eq. 1 .2 and the relation 2.9 we get: 



rk = Qkei - Qk+i(cih + c 2 L k U k )y k 



(3.4) 
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where t k is obtained by appending a row of k zeros to the unit matrix 1. Projecting this equation 
onto the Krylov subspace K k one has: 

o = e 1 - (cil fc + c 2 L k U k 1 )y k (3.5) 

or equivalently: 

(dU k + c 2 L k )z k = ei, 2 fc = U k l y k (3.6) 
Note that the matrix on the left hand side is tridiagonal and we can write: 

T k z k = ei, T k = c x U k + c 2 L k (3.7) 

The LU decomposition of the matrix T k is denoted by T k = L k U k . Hence the solution can be 
written as: 

z k = U k 1 L k 1 e 1 = U k - 1 ^ l61 j = i^ k '^ + a k tJ k l e k (3.8) 

where 

a k = elT-'e, = elL^e, (3.9) 

Then from x k = Q k U k z k with Q k U k = [Q fc _iC/ fe _i, g fc + gjfc-iUfc-i,fc] and from the recurrence 
for eq. 3.8 we get: 

Xfc = Q k -xU k -\Zk-\ + «fc QkU k U k x e k (3.10) 

Finally, denoting, 

w k = Q k U k (j~ l e k (3.11) 

we have: 

x fc = rc fc _i + a fc Wfe (3.12) 
Using this result and the definition of the residual vector, eq. 3.1 we get: 

r k = r k _! - a k Dw k (3.13) 
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Using matrices L k , U k it is easy to show (see Appendix A) that the following recurrences hold: 

7k -1 

w k = qk + 9fc-iWfc-i,fc + ~ (3.14) 

ffc-l,k-l 

«k = ^^-a k -i (3.15) 
'kk 

4fc = ci + c 2 /fcfc - ^ l7fc 1 (3.16) 

tfc-i,k-i 

where = -c 2 k+i,k and 7 fc = -c^k+i- 

In this way we have specified the iterations for this linear solver, which is called the shifted 
unitary orthogonal method or SUOM. Below we give the MATLAB function SUOM.m. 2 The 
input is the right hand side vector b, the exact solution x , the unitary matrix V, the real constants 
Ci and c 2 , the tolerance tol and the maximum number of iterations imax. The output is the error 
norm history rr and the solution x. 

function [rr,x] = SUOM(b , xO ,V, cl , c2 , tol , imax ) ; 
b=b(:); 

N=max( size (b ) ) ; 
vzero = zeros (N, 1 ) ; 
x=vzero ; 
r=b; 

rho = norm( r ) ; 
rnorm = rho ; 
rr=norm(xO ) ; 
alpha = rho; 
ul2=0; 
beta = 1 ; 
LI 1 _tilde =1 ; 
q=r / rho ; 

2 An e-copy of the function can be downloaded form the hep-lat posting of the paper: MATLAB codes are left 
intact when included in the body of the LaTeX source. 
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q_old = vzero ; v_old = vzero ; w_old=vzero ; s_old = vzero; 
counter = 1; 

while ( (rnorm > tol ) & ( counter <=imax ) ); 
v=V*q ; 

if ( counter > 1) , 

ul2=-(q_old '*v)/( q_old '* v_old ); 
end 

gamma=— cl *ul2 ; 
Lll=(q'*v)+ul2*(q'* v_old ); 
q_tilde=v— LI 1 *q+ul2 * v _old ; 
L21=norm( q_tilde ) ; 
if (L21<=tol), break, end; 
w=q+q_old *ul2+w_old *gamma/ Lll_tilde; 

s=cl*(q+q_old*ul2)+c2*(v+v_old*ul2)+s_old *gamma/ L 1 1 _tilde ; 
Lll_tilde = cl + c2*Lll— beta *gamma/ L 1 1 _tilde ; 
alpha = alpha*beta/Lll_tilde ; 

x=x+w* alpha ; 
r=r— s * alpha ; 

q_old=q; v_old=v; w_old=w; s_old = s; 
q=q_tilde /L21 ; 
beta=-c2*L21 ; 
rnorm =norm( r ) ; 
rr = [ rr ;norm(xO— x ) ] ; 
counter ++; 
end 

Note that the code can be trivially changed in order to get the residual error norm instead 
of the error norm, or to use x as a starting guess by defining the starting residual error as 
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r = b — Dxq. 

Another advantage of our MATLAB code is its easy inclusion into a C++ code which uses 
uBLAS libraries for inner products and Euclidean norms 3 . We have used these libraries to 
construct such a C++ function for use with overlap fermions. In this case, the matrix- vector 
multiplication is an external routine which applies the inverse square root of D^D W to the 
current Arnoldi vector q followed by a D w multiplication. 



3.2 A minimal residual strategy 

As in the previous case our starting point is the Krylov subspace K, k . But instead of projecting 
the residual vector we seek the minimum of its 2-norm on this subspace. The solution in this 
case is denoted by x k and is written formally as a solution of a Least Squares Problem (LSP): 

x k = arg min \\b — Dx\\ 2 (3.17) 



Requiring x = Q k y and using eq. 3.4 we get: 



y k = arg mm 

y£C k 



Qk+i [ei - (cil fc + c 2 L k U k ^y] 



(3.18) 



Since Arnoldi vectors are orthonormal, the matrix Q k+ \ can be ignored and we end up with a 
much smaller LSP. Note that the matrix H k = L k U^ 1 has orthonormal columns (see Proposition 
2. 1 ), a property which can be used to get a short recurrence algorithm as in the case of the SUMR 
algorithm [8] . However, as in the case of SUOM, we follow a strategy that involves a tridiagonal 
matrix in the LSP. This way the solution of the smaller problem is given by: 



z k = arg mm 

zec fe 



ei - (ciC/ fc + c 2 L k )z 



, yk = U k z k (3.19) 

2 



where 



> vVk c-iLk ■ | A | //,■ i' = r-i h- h /. (3-20) 

In order to compute the iterates from those of the SUOM algorithm we split the solution 
vector as follows: 

z k = z k + £ k (3.21) 



http : / / www . boost . org/ libs/ numeric/ ublas /doc/ overview . htm 
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Using equations 3.7-3.9 we get: 



£ fe = arg min va k e k+l + T fc £ 
We solve this LSP by QR factorization of the matrix T k : 



(3.22) 



Tk — OlR k , Rk 



o 



(3.23) 



with O k being afc+lx/c + l unitary matrix and R k an upper tridiagonal matrix. Therefore we 
have: 

ik = arg min va k O k e k+ i + R k £ (3.24) 

As it is usual in this case, the unitary matrix can be constructed using Givens rotations. At step 
k one can express it in the form: 



O k -i 
Ok = Gk | | , Gk 

1 



h-i 



\ 

Ck s k 



2.1 1 2 -. 
» c k + \ s k\ = 1 



(3.25) 



From this it is clear that: 



Ok^k+i — Sfcefe + c k e k +i 



which gives: 



mm 



voikO k ek + i + R k £ 



= min \\va k s k ek + Rk£\\ 2 + W®kCk\ 
2 §ec fc 



(3.26) 



(3.27) 



The right hand side is minimal if its first term is minimal. If we assume that R k has full rank 
then this term must vanish. In this case the solution is given by: 



6c = -Rk l CkVa k s k 



Using 3.18-3.21, the solution to the original problem can be written as: 



Xk = QkUkZk + QkUkik = x k + QkUkik = x k + x k 



(3.28) 



where 



Xk = -QkUkR^ekfotkSk 



(3.29) 



(3.30) 
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In order to simplify the expression we define the matrix P k : 



P k = [pi, ...,p k ] = QkU k R k 



-i 



(3.31) 



and denote uj k = va k s k . This way we have: 



Xk — —^kPk 



(3.32) 



To complete the algorithm one should write the recurrence on p k . Multiplying both sides of 
3.31 by R k from the right we get P k R k = Q k U k , which gives: 



where by /j, k , e k and 6 k we denote the only non-zero entries of the last column of R k . From this 
equation we get: 



This completes the description of the method, which we call SHUMR in order to distinguish 
it from the SUMR algorithm. The details of the QR decomposition are given in Appendix B. 
The MATLAB code of the algorithm is listed below. It differs from the SUOM code with lines 
between "start added lines" and "end added lines". One can make here the same remarks as 
made in the case of the SUOM.m function. In particular, the code can be easily modified into a 
C++ code using uBLAS libraries. 

function [rr,x] = SHUMR(b , xO , V, cl , c2 , tol , imax ) ; 
b=b ( : ) ; 

N=max( size (b ) ) ; 
vzero = zeros (N, 1 ) ; 
x=vzero ; 
r=b; 

rho = norm(r); 
rnorm=rho ; 



Pkl^k + Pk-i£k + Pk-2&k = qk + qk-iUk-i,k 



(3.33) 



Pk = (Qk + 9fc-iWfc-i,fc - Pk-i£k ~ Pk-20k)/Hk 



(3.34) 
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r r =norm ( xO ) ; 

alpha = rho ; 

ul2=0; 

beta = 1 ; 

L 1 1 _tilde =1 ; 

q=r / rho ; 

q_old = vzero ; v_old = vzero ; w_old=vzero ; s_old = vzero; 



% start added lines 

c_kml=l; s_kml=0; c_km2=0; s_km2 = 0; 

pl = vzero ; p2=vzero ; Dpl=vzero ; Dp2=vzero ; 

% end added lines 



counter = 1; 

while ( (rnorm > tol) & ( counter <=imax ) ); 
v=V*q ; 

if ( counter > 1) , 

ul2=-(q_old '* v )/( q_old '* v_old ) ; 
end 

gamma=— cl *ul2 ; 
Lll=(q'*v)+ul2*(q'* v_old ); 
q _t i Id e =v— LI 1 *q+ul2 * v _old ; 
L21=norm( q_tilde ) ; 
if (L21<=tol), break, end; 
w=q+q_old *ul2+w_old*gamma/ L 1 1 _tilde ; 

s=cl*(q+q_old*ul2)+c2*(v+v_old*ul2)+s_old *gamma/ L 1 1 _tilde ; 
Lll_tilde = cl + c2*Lll— beta *gamma/ L 1 1 _tilde ; 
alpha = alpha*beta/Lll_tilde ; 



x=x+w* alpha ; 
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r=r— s * alpha ; 

q_old=q; v_old=v; w_old=w; s_old = s; 
q=q_tilde /L21 ; 
beta=-c2*L21 ; 
rnorm=norm( r ) ; 

% start added lines 
tll=cl+c2*Lll ; 

mu=tl 1 *c_kml+gamma*conj (s_kml )*c_km2 ; 
nu=c2*L21 ; 
if (mu != 0), 

c_k=abs(mu)/sqrt( abs (mu)* abs (mu)+ abs (nu )* abs (nu ) ) ; 

s_k = conj ( c_k*nu /mu) ; 
else 

c_k = 0; 

s_k = l; 
end 

omega=nu* alpha * s_k ; 

mu_k= c _k *mu+ s _k * nu ; 

eps = tl 1 *s_kml— gamma *c_kml *c_km2 ; 

theta=— gamma* s_km2 ; 

p = (q+q_old *ul2— pi *eps— p2* thet a )/ mu_k ; 

Dp=(cl*(q+q_old*ul2) + c2*(v+v_old *ul2)-Dpl*eps— Dp2* thet a )/ mu_k ; 

rnorm_p=norm( r+omega*Dp ) ; 
xp=x— omega *p ; 

c_km2=c_kml ; s_km2=s_kml ; p2=pl ; Dp2=Dpl ; 
c_kml = c_k; s_kml = s_k; pl=p; Dpl=Dp; 
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% end added lines 



r r = [ r r ; norm ( xO— xp ) ] ; 
counter ++; 
end 

3.3 A numerical example 

We note that SUMR and SHUMR algorithms differ in the underlying Arnoldi process: the 
SUMR algorithm uses two coupled two-term recurrences which, when are solved, yield the 
usual long recurrence of the Arnoldi algorithm; the SHUMR algorithm uses a three-term recur- 
rence. In principle, it is possible to compare theoretically the effect of such a difference, but this 
goes beyond the purpose of this paper. We have chosen instead a direct numerical comparison 
in case of overlap fermions, since this is of great importance for practical lattice computations. 
Before we do so we give a first example in the case of a small, i.e. 200 x 200 unitary matrix. 

The example is similar to Example 2 of the paper of Jagels and Reichel [8]. Let W unitary 
matrix of the order 200 resulting from the QR decomposition of a matrix with random elements 
in the interval (0, 1). Let A be a diagonal unitary matrix with elements \ k = e l9k where 

9 k = n(k-l)/6, l<k<6 

and the rest is randomly distributed in the interval (— 7r/4, 7r/4). Then, the unitary matrix V is 
defined by: 

V = WAW* 

We have tested SHUMR,SUOM,SUMR algorithms for solving the linear system Ax = b 
where A — c\\ + c 2 V with c\ = 1.05 and c 2 = 1 and random right hand side b. The results are 
shown in Figure 1 . We observe that SHUMR and SUOM lie on top of each other (in this scale) 
and converge linearly until they stagnate below 10~ 14 . On the other hand the convergence rate 
of the SUMR algorithm slows down when the value of the error norm is around 10~ 10 . It is not 
clear why SUMR differs in this way from SHUMR and SUOM in this particular example. 
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Figure 1 : Comparison of SHUMR,SUOM,SUMR algorithms for solving a small shifted unitary 
linear system as described in the text. 

4 Comparison of algorithms for the overlap inversion 

In this section we compare the convergence of SUMR, SUOM and SHUMR algorithms in 
the case of overlap fermions. SUMR and SHUMR are geometrically optimal algorithms for 
shifted linear systems, whereas SUOM is algebraically optimal in the sense defined above. For 
completeness we display the convergence of Conjugate Residuals (CR), Conjugate Gradients 
on Normal Equations (CGNE) and a special variation of CGNE which we call CG-CHI. The 
latter exploits the fact that D*D is block diagonal and solves simultaneously the two decoupled 
chiral systems. 

Note that the computation of D as applied to a vector is a numerical problem, which is by 
now well researched. A good review of these methods can be found in [13]. We use the Lanczos 
method [14, 15], in the double pass version and without i/yi/-eigenvalue projection. 

In the figures below we show the convergence of algorithms as a function of Wilson matrix- 
vector multiplication number on 8 3 16 quenched lattices at various couplings and quark masses. 
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Figure 2 compares all above mentioned algorithms for quark mass m = 0.05 at (3 — 6 and 
(3 = 5.7. The first observation is that SHUMR, SUMR, SUOM and CR are more efficient than 
CGNE and CG-CHI algorithms. This is observed by the other groups as well [9]. Hence, we 
decided not run these algorithms further for smaller quark masses. The second observation is 
that SUMR, SUOM and CR converge neck-to-neck with CR being slightly worse at (3 — 6. The 
third observation is that SHUMR converges 10 - 15% faster than SUMR and SUOM. 

Figure 3 compares the algorithms remaining in race for the quark mass m = 0.01at,3 = 6 
and (3 = 5.7. At this lighter mass we observe a superlinear convergence rate of SHUMR, SUMR 
and SUOM: the rate increases around residual norm 10~ 3 at f3 — 6 and around 10~ 5 at f3 — 5.7. 
This is to be contrasted to the linear convergence of CR. A second observation at this mass is 
the emergence of the pattern that SHUMR converges faster than SUMR and the latter converges 
faster than SUOM. This pattern is confirmed in Figure 4 where the quark mass is lowered to 
m = 0.005. 

Hence, the best algorithms are the optimal algorithms SHUMR and SUMR with SHUMR 
converging 10 — 15% faster in all cases. Although both algorithms minimise the residual vector 
norm they differ in the underlying Arnoldi process. The nature of short recurrences employed by 
SHUMR may impact numerical properties of the algorithm. Hence, generated Krylov subspaces 
are different and we may conclude that SHUMR explores it more efficiently. However, at 
present we have no theoretical tool to characterise the difference. 

5 Conclusions 

In this paper we have presented two iterative solvers which are based on a new Arnoldi type 
algorithm for shifted unitary systems. The SUOM solver constructs residual vectors which are 
orthogonal to the Krylov subspace. The SHUMR solver minimises the residual vector over the 
generated Krylov subspace. Both algorithms are short recurrence specialisations of FOM and 
GMRES for shifted unitary systems. Taking into account the SUMR algorithm of Jagels and 
Reichel [8], the short recurrence algorithms for such systems are hardly a new result. But, it is 
somewhat surprising that the SHUMR algorithm performs better than SUMR on the examples 
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Figure 2: Convergence history of various solvers on background gauge fields at (3 = 6 (upper 
part) and (3 = 5.7 (lower part) and quark mass m = 0.05 
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Figure 3: Convergence history of SUMR, SUOM, SHUMR and CR on background gauge fields 
at (3 = 6 (upper part) and (3 = 5.7 (lower part) and quark mass m = 0.01 
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Figure 4: Convergence history of SUMR, SUOM and SHUMR on background gauge fields at 
(3 = 6 and quark mass m = 0.005 



shown in this paper. In particular, we presented a new Arnoldi unitary process, Algorithm 2, 
which is intrinsically a short recurrence process, a result which appears for the first time. 

On the application level we conclude that SHUMR algorithm outperforms all other knwon 
iterative solvers for quark propagator computations with overlap fermions. The MATLAB codes 
of SUOM and SHUMR algorithms given here allow an easy conversion to other object oriented 
code like C++ which uses uBLAS libraries. 
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Appendix A 



In order to derive the recursions 3.14-3.16 we partition the tridiagonal matrix T k in the form: 

Lk-iUk-i — 7fc-i e fc-i 

—Ph-l^k-l Ci + C 2 hk 

from where the LU factors are found to be: 



Lk-i \ ~ / U k -i —^ k -\L k \e k -\ 

-ft-ieLi^T-i hk) \ 1 

with 

4fc = ci + c 2 / fc fc - /^fc-iTfc-ie^iT^efc-i (5.1) 

Their inversion gives: 



i) From 3.1 1 and applying e k to the right of U^ 1 one gets the first recursion of 3.14: 

Wfc = qk + q k -iu k - ljk + ~ w k -i 

*fe— i,fc— i 

where the initial vector is set to w\ — q±. 

ii) Using 3.9, applying to the left and e\ to the right of L^ 1 one gets the second recursion 
of 3.15: 

Pk-i 
a k = -= — a k - X 

hk 

and with ax = ||6|| 2 // n . 

iii) Finally, observing that: 

J- = e k T k e k 
hk 

and and using 5.1 gives the third recursion of 3.16: 

Pk-ilk-i 



hk = ci + c 2 l 



kk 



h-i,k-i 
with l u = ci + c 2 lu. 
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Appendix B 



In order to complete the derivation of the SHUMR algorithm one has to specify the Givens 
matrix G k and the last column of R k . Let t k be the last column of T k and Jx k be the last column 
of R k . Then, we have: 

fik — O k t k , t k = ve k+ i + t kk e k — 7 fc _ie fc _i 
where t kk — c± + c 2 l kk . From the definition of O k , eq. 3.25 it is clear that: 




(5.2) 



and from eq. 3.26: 



O k -\e k — s k _ie k _i + Cfe_ie fc 



(5.3) 



Using 3.25 again and applying the above result for O k ^k-\ one finds: 



O k -\e k ^i — G k -\ 




— s k -2G k -\e k -2 + Cfc_2Gfc_iefc_i 



Since G k -ie k - 2 = e fc _ 2 and G fe _ie fc _i = c k ^k-\ - s k -ie k one gets: 



O k -\e k ~\ — Sfc_ 2 e / fc_ 2 + Cfc_ 2 Cfc_ie fe _ 1 — c k ^ 2 s k -ie k 



(5.4) 



Substituting 5.3 and 5.4 to 5.2 one obtains: 



\ 



\ 



v 



J 



where 



tkkCk-l + 7fc-lCfc-2Cfe_l 



£k — tkkSk-l — lk-\Ck-2Ck~l 



Ok — — lk-l-Sk-2 
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with c = 1, s = c_i = s_i = 70 = 0. 

The values of // fc and s fe , c fc are determined by the condition: 



Gk 



V 



^k e k-l + 







or equivalently by: 



-Sfc Ck] \v, 







It is easy to see that = c fc/ u + s fc z/ and s fc = c k v/ ^. Using c| + |s fc | 2 = 1 one has: 

Cfc = 



2,| 2 

A* I + \v\ 



For = one has c k = and s k = 1. 
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